Highly accurate whole-genome imputation of SARS-CoV-2 from partial or low-quality sequences

Abstract Background The current SARS-CoV-2 pandemic has emphasized the utility of viral whole-genome sequencing in the surveillance and control of the pathogen. An unprecedented ongoing global initiative is producing hundreds of thousands of sequences worldwide. However, the complex circumstances in which viruses are sequenced, along with the demand of urgent results, causes a high rate of incomplete and, therefore, useless sequences. Viral sequences evolve in the context of a complex phylogeny and different positions along the genome are in linkage disequilibrium. Therefore, an imputation method would be able to predict missing positions from the available sequencing data. Results We have developed the impuSARS application, which takes advantage of the enormous number of SARS-CoV-2 genomes available, using a reference panel containing 239,301 sequences, to produce missing data imputation in viral genomes. ImpuSARS was tested in a wide range of conditions (continuous fragments, amplicons or sparse individual positions missing), showing great fidelity when reconstructing the original sequences, recovering the lineage with a 100% precision for almost all the lineages, even in very poorly covered genomes (<20%). Conclusions Imputation can improve the pace of SARS-CoV-2 sequencing production by recovering many incomplete or low-quality sequences that would be otherwise discarded. ImpuSARS can be incorporated in any primary data processing pipeline for SARS-CoV-2 whole-genome sequencing.


Background
SARS-CoV-2 is a 30-kb single-stranded RNA non-fragmented virus.It is classified, together with HCoV-OC43, HCoV-HKU1, SARS-CoV-1, and MERS-CoV, into the β coronaviridae.SARS-CoV-2 was first described in Wuhan, China, in December 2019 and is responsible for COVID-19, which was declared a pandemic by the World Health Organization (WHO) in March 2020 [1].Whole-genome sequencing (WGS) has been successfully used for classification [2], studying transmission dynamics [3], and evaluating global and regional patterns of pandemic spread [4].WGS also has the potential to study reinfections, which have been described in a number of patients [5], and has recently gained prominence to characterize viral variants that may escape the neutralizing activity of the antibodies produced by vaccines [6].Unfortunately, WGS results, especially in complex scenarios like this pandemic, are often imperfect, rendering incomplete viral sequences, with significant regions of the genome poorly covered [7].In fact, current systems for viral lineage identification, a highly relevant step for the control of potentially harmful strains, fail to provide a lineage assignment if a percentage (typically >50%) of the viral sequences is missing [8].Given the short response times required in clinics, resequencing low-quality results is frequently not an option.Therefore, alternatives to improve sequencing results, used in other fields, such as genotype imputation, would be extremely useful in this scenario as well.Genotype imputation has traditionally been a crucial component of genome-wide association studies, by increasing the power of the findings, helping in their interpretation, and facilitating further meta-analysis [9].Genotype imputation relies on the existing correlation between genetic variations or mutations at sites across the genome of an organism [10].Using this correlation, imputation methods accurately assign genotypes at untyped markers, improving genome coverage [10][11][12][13][14].The accuracy of this imputation process improves as the number of haplotypes in the reference panel of sequenced genomes increases [15,16], especially for mutations present at low frequencies (minor allele frequency <0.5%).The accuracy can also be increased with large reference panels.In the case of human genomes, the Haplotype Reference Consortium, composed of ∼32,000 individuals, is considered a large panel, able to reach an accurate imputation for mutations with frequencies of ≤0.1-0.5% [14].In the case of SARS-CoV-2, the outstanding international effort of sequencing has generated in a short time span a genomic database 10 times larger.In spite of the interest in WGS viral studies and the fact that typically the sequences are imperfect, with positions and regions missing, the imputation, with a few exceptions [17,18], has scarcely been used in the viral realm, probably because resequencing them resulted in a more practical solution.However, in scenarios in which sampling is logistically complex or takes place under emergency conditions, like the SARS-CoV-2 pandemic, imputation may play a relevant role.
In addition, because WGS may not be routinely available for clinical laboratories, protocols for partial sequencing of the SARS-CoV-2 genome, or even partial sequencing of the spike, where most of the determinants for variant characterization are located, are becoming available [19].Given the importance of sequencing viral whole genomes for epidemiologic surveillance purposes, as stressed by the WHO [20] and the European Parliament [21], a tool for genotype imputation in SARS-CoV-2 would increase the sequencing throughput by recovering many sequences discarded for low quality that still contain valid information for lineage or clade assignment.Similarly, sequencing kits that only cover some key stretches already miss (or will miss future) relevant mutations.Imputation may predict the existence of these variants of interest (VOI) or variants of concern (VOC) because of their linkage disequilibrium (LD) with resolved parts of the viral genome.Here a fully tested, highly accurate reference panel and tool for the imputation of SARS-CoV-2 wholegenome sequences from incomplete or partial sequences is presented.

SARS-CoV-2 Imputation
SARS-CoV-2 sequences' imputation (impuSARS) was performed by using the Minimac software (Minimac, RRID:SCR 009292) [14].Although Minimac was originally designed for human samples with diploid genotypes, the tool allows imputing haploid genomes as SARS-COV-2 because it supports imputation for non-pseudoautosomal regions at human males' chromosome X.The reference panel was built with Minimac3 whereas Minimac4 was used for imputation.Minimac4 provides imputation qualities comparable to those of Minimac3, but it reduces memory usage and computational costs.The impuSARS tool accepts either FASTA sequence or variation (VCF) inputs.Note that FASTA sequence can include missing regions (which can be absent or tagged as N), which will be then imputed.FASTA input is aligned to reference with Muscle [22] to retrieve mutation positions.Also, VCF input should include both mutant and reference genotypes when available.
The initial reference panel was created with the available SARS-CoV-2 sequences from Global Initiative on Sharing All Influenza Data (GISAID) [23,24] (downloaded on 7 January 2021).Only sequences including >29 kb and <1% missing bases were kept ("complete" and "high coverage" tags in GISAID, respectively).Also, sequences were converted to a multi-sample VCF format to only compute mutation positions.As defined by GI-SAID, the hCoV-19/Wuhan/WIV04/2019 sequence (accession No. EPI ISL 402 124) was considered the official reference sequence.From this multi-sample VCF, unique mutations, i.e., private mutations for each sequence, were discarded.Therefore, the final reference panel contained 239,301 sequences.The parameter estimation for the reference panel had already been precomputed with Minimac (version 3) to speed up the imputation process (reference panel provided in M3VCF format).This reference panel is periodically updated to allow the collection of novel variants, especially VOIs and VOCs.The last reference panel (v3.0) was generated by July 2021 including >900,000 sequences and expanding it to other mutation types such as small indels.
Once the imputation is performed using the reference panel, impuSARS will retrieve the imputed consensus sequence provided by bcftools consensus v1.11 [25].Also, the associated lineage for each imputed consensus sequence will be obtained with PANGOLIN v1.10.2 [8].PANGOLIN assigns a detailed lineage identifier to each sequence on the basis of a multinomial logistic regression model [26].PANGOLIN classifies sequences along a hierarchical tree reflecting evolutionary events.Each level of the hierarchical tree gathers a group of sequences with common evidence associated with an epidemiological event (usually related to new variations), which could produce an emerging edge of the pandemic [26].Lineages becoming important in the lowest levels of the phylogeny are retagged with aliases to avoid infinite spread across the hierarchical tree, thus keeping it compacted in 4 levels at most.Finally, although impuSARS was originally designed for SARS-CoV-2 imputation, note that the tool is adapted to impute any other viral genomes if required.For this purpose, im-puSARS includes a complementary tool for users to create their customized reference panel from a set of sequences.Custom reference panels can then be used by impuSARS for other partial genome imputations.In that case, PANGOLIN lineages will be disabled because they are focused on SARS-CoV-2 lineages.

Validation procedure
SARS-CoV-2 imputation was evaluated by using a 10-fold crossvalidation process.The dataset was randomly partitioned into 10 test subsets.For each test subset, the imputation panel was computed for the remaining 9 datasets (training subsets).Initially, the loss of genomic regions was simulated by progressively increasing the percentage of the missing genome by 10% intervals.Three different strategies were used to select these missing regions: (i) random selection of only 1 missing region (continuous block), (ii) random selection of mutation positions (missing sites), and (iii) random selection of amplicon regions that are usually independently amplified in SARS-CoV-2 sequencing (missing discontinuous blocks).Amplicon regions were defined by the hCoV-2019/nCoV-2019 v3 Amplicon Set [29] recommended by the ARTIC network [30].Missing regions for amplicons were simulated as percentages of amplicons completely uncovered.The whole learning-testing procedure was repeated 3 times to reduce bias produced by the random selection.Additionally, imputation was also validated by iteratively removing a sliding window of 3 kb (∼10% of the entire genome) by 1.5kb steps.This process will allow determination of those hot spot regions in the SARS-CoV-2 genome that are harder to impute if missed.
After validating imputation with several random selections, 2 more real scenarios were considered: (i) imputation from regions covered by the genotyping assay kit DeepChek R -8-plex CoV-2 [31] and (ii) imputation only from mutations belonging to the Spike protein (S) region.As above, a 10-fold cross-validation process was implemented in both cases.The genotyping assay covers several selected regions that represent ∼20% of the entire SARS-COV-2 genome; hence imputation can provide a more comprehensive, improved result.Alternatively, S protein is 1 of the most commonly sequenced regions for SARS-CoV-2 given its crucial role in the docking receptor recognition and cell membrane fusion [32,33].Moreover, mutations in Spike have been related to transmissibility or the ability to evade the host immune response [34].Therefore, studying the ability of imputing the entire SAR-CoV-2 genome from the Spike region can benefit subsequent lineage classification, thus being crucial for epidemiological surveillance.
To facilitate the interpretation of the results the precision, recall, and F1 scores have been computed.Because this is a heavily unbalanced problem (much lower number of mutations against reference positions), the Matthews correlation coefficient (MCC) and balanced accuracy (BACC) scores, which are better suited for handling such scenarios [35][36][37], have also been provided.For these scores, positions with mutations in each real sequence are considered positive whereas reference positions are negative.Therefore, correctly imputed mutations and reference positions are considered true-positive and true-negative results, respectively.Otherwise, wrongly imputed mutations and reference nucleotides are computed as false-positive and falsenegative.Thus, recall determines the true-positive rate whereas precision represents the positive predictive value.The F1-score represents the harmonic mean of the previous 2 metrics.The MCC measures the correlation and agreement between the truth and the predicted labels and varies between −1 and 1, where −1 refers to complete disagreement between the predicted and truth labels; 0, an average random prediction; and 1, a perfect prediction.Finally, the balanced accuracy is the arithmetic mean of sensitivity and specificity.

Lineage classification
Imputations from simulated genotyping assay and Spike region test subsets were also evaluated in terms of the lineage assigned to the imputed sequences.A standard accuracy metric was calculated to evaluate assigned lineages from imputed sequences against real lineages from original GISAID sequences.Additionally, 2 baseline models were implemented to evaluate the influence of known mutations against missing ones over the assignment of lineages.The first baseline model simply filled missing regions with the SARS-CoV-2 reference sequence.The second model randomly generated the genotype to the missing mutation positions of the entire test subset weighting probabilities by the original genotype frequency in the training datasets.For comparison purposes, lineages were also obtained for the resulting sequences using these 2 baseline models.

Imputation test with independent datasets
After the entire validation process, the final reference panel including the 239,301 GISAID sequences was built.Several independent datasets were considered for this test phase using the definitive reference panel: (i) new GISAID sequences not included in the reference panel belonging to lineages of interest; (ii) 8 samples sequenced at the Hospital San Cecilio (Granada, Spain) by using both the DeepChek R -8Plex-CoV2 genotyping array [31] and WGS as described below; and (iii) 1 sample, assigned to the B.1.351(β-variant) [38] by an experimental RT-PCR kit, subjected to WGS that resulted in an incomplete whole-genome sequence, at Hospital Virgen del Rocio (Seville, Spain).
In the first test, new GISAID sequences from highly relevant lineages like B.1.1.7 (α-variant) [39] and B.1.351(β-variant) [38] were selected: 64,398 and 970 sequences, respectively (sequences downloaded by 23 February 2021).As in the previous validation phase, these sequences were also tested by iteratively removing a 3-kb window sliding by 1.5-kb steps in the entire genome.In this way the importance of specific regions to impute relevant lineages could be evaluated.In the second test the variations obtained by the genotyping array were used to impute the entire genome and the assigned lineages are compared against whole-genome results.Finally, the imputation tool was used in a third test to solve a real case in which an experimental research use only (RUO) test warned of a potential VOC but the confirmatory WGS was of poor quality in a scenario where a quick informed decision was required.Then, the poor-quality sequence was used to impute the whole-genome sequence and lineage.The resolution of this case proves the level of resolution and accuracy of the imputation procedure presented here.

Genotyping array and whole-genome sequencing of viral samples
Eight SARS-CoV-2 nasopharyngeal samples were sequenced following the manufacturer DeepChek R -8Plex-CoV2 genotyping array protocol [31].WGS of the same samples was carried out following the ARTIC protocol [30] with the hCoV-2019/nCoV-2019 v3 Amplicon Set [29].Whole-genome samples were sequenced in a NextSeq 500 sequencer by Illumina with 150-bp paired-end reads and a total coverage of ∼500,000 reads per sample.

Sequence data preprocessing
Sequencing data (150 bp ×2) were analyzed using in-house scripts and the nf-core/viralrecon pipeline software [40].Briefly, after read quality filtering, sequences for each sample were aligned to the SARS-CoV-2 isolate Wuhan-Hu-1 reference genome (MN908947.3)using bowtie 2 algorithm (Bowtie, RRID: SCR 005476) [41], followed by primer sequence removal and duplicate read marking using iVar [42] and Picard (Picard, RRID:SC R 006525) [43] tools, respectively.Genomic mutations are identified through iVar software, using a minimum allele frequency threshold of 0.25 for calling mutations and a filtering step to keep mutations with a minimum allele frequency threshold of 0.75.Using the set of high-confidence mutations and the MN908947.3genome, a consensus genome per sample is finally built using iVar.

Imputation of randomly simulated missing regions
Each of the 10 test subsets in the 10-fold cross-validation was reduced by randomly simulating missing regions in increasing percentages (10-90%).This process was repeated 3 times for each missing percentage.Classification metrics (MCC, BACC, and F1-score) were obtained for each reduced test dataset as shown in Fig. 1A for 1 random region (missing continuous blocks), Fig. 1B for randomly selected mutations (missing sites), and Fig. 1C for randomly selected amplicons (missing discontinuous blocks).In all cases, imputation performance metrics mean values were >0.65 even for the worst scenario (imputing only from 10% of the genome).Imputation progressively improves when known sequence percentages are increasing, reaching mean values >0.95 for those tests with 90% known genomes.Interestingly, the performance metrics presented a higher dispersion (including some lower outliers) when imputing only 10% of the genome in 1 continuous block (Fig. 1A) whereas this dispersion is more marked at the opposite end of the range of values, for 90% missing regions for missing mutations and discontinuous blocks (Fig. 1B and C).This behavior might be related to the fact that leaving only 1 small random block to impute can involve regions where mutations are rare and harder to impute, even with the remaining 90% known ones.The imputation by missing sliding windows proposed in the next section will help to confirm that hypothesis.Finally, even for extremely high missing percentages like the genotyping assays (∼80%) or only Spike regions used below, the obtained metrics suggest a reasonably accurate imputation.

Effects of missing specific locations
As previously noted, imputation performance is strongly associated with the region missing coverage in the SARS-CoV-2 genome.Therefore, the importance of selecting adequate regions when sequencing SARS-CoV-2 samples and its influence in a subsequent imputation of the remaining regions is analyzed here.For this purpose, a 3-kb window was iteratively removed and imputed from the entire genome, repeating the process by 1.5-kb steps.For the sake of clarity, only key metrics such as precision, recall, and MCC of each imputed window along the entire genome are shown in Fig. 2. Additional metrics BACC and F1-Score are available in Supplementary Fig. S1.Several hot spots (4 regions) have been identified as critical positions where mutations are harder to impute when the block around is missing.More specifically, uncovered regions in positions around 3k, 12k, 16.5k (orf1ab protein, replicase polyprotein 1ab), and 24k (S protein, Spike glycoprotein) would slightly reduce imputation ability.As previously suggested, note that those identified hot spots are strongly associated with regions where mutations are less frequent in the reference panel (dashed green line).Recall values tend to be lower than precision because of the private mutations in the variants, which are virtually impossible to impute because of the lack of information on LD with other mutations.This is not a problem of impuSARS but a general drawback of any imputation method or strategy.

Imputation from genotyping assay and spike regions
Once the robustness of the imputation in different missing region scenarios has been validated, the focus is set on the validation of the imputation of genomes using only data from the genotyping assay regions previously described or from the Spike protein region.Table 1 shows imputation performance metrics for both cases per test subset.Also, these metrics were calculated against the frequency of imputed mutations in the reference panel (Fig. 3).In both cases, only the representative metrics precision, recall, and MCC were kept.Detailed results for the other mentioned metrics (BACC and F1-score) can be found in Supplementary Table S1 and Supplementary Fig. S2.As shown in Table 1, the imputation performance surpasses 0.81 in the 3 averaged metrics, precision being the highest with >0.96 for both regions while recall remains at 0.86 and 0.81 for genotyping assay and Spike regions, respectively.Regarding Fig. 3, mutation imputation quickly increases to >0.96 in the 3 performance metrics (recall, precision, and MCC) for mutations with frequencies >0.01 and >0.03 for the genotyping array and Spike region imputations, respectively.The imputation from genotyping array sequences reaches its maximum values (>0.996) from frequencies >0.33 for precision and recall metrics, whereas MCC slightly decreases to 0.895 after the same frequency threshold.For imputation from the Spike region, an improvement is also observed from mutation frequencies >0.33 reaching performance values of 0.998 and 0.969 for recall and precision, respectively, but a more drastic decrease is observed in MCC.This MCC decrease is correlated in both cases with the decrease in the number of mutations (green line).When mutation frequency increases, a smaller number of mutations are found but datasets are inversely unbalanced (more mutant than reference positions), which metric-wise is better captured by the MCC.Nevertheless, imputing positive cases (mutations) in those situations is more relevant, so results in recall and precision metrics are more informative.

Lineage classification
The previously imputed mutations for the simulated genotyping arrays and Spike region subsets are used to rebuild the con-sensus whole-genome sequences and assign their corresponding lineages with PANGOLIN.The quality of the imputed lineage has been measured by the accuracy metric against real lineages and compared to 2 baseline models (Fig. 4).Briefly, these 2 mod-   els, respectively, filled missing regions with random mutations assigned by frequency ("Random fill") or with nucleotides from the reference sequence ("Reference fill") (see Material and Methods section: for details).Also, accuracy was calculated for the different levels of the hierarchical tree in PANGOLIN lineages.As shown, the first level in the hierarchical classification of lineage was almost always correctly determined (>98%), even for the 2 baseline models.That is, the information provided by the already known regions (genotyping array and Spike protein) was enough to classify this first level.However, the imputed solution becomes more relevant as a lower level has to be determined.Hence, imputation clearly outperformed both baseline methods when lineages were assigned at third and fourth level, achieving 77% and 68% accuracy for genotyping array and spike regions, respectively.As expected, imputation from the genotyping array positions comes up with higher lineage accuracies than the solution with Spike because this kit was specifically designed to capture relevant regions in the SARS-CoV-2 genome.Even so, imputation still produces strong benefits in the lineage assignment for the genotyping array regions, clearly improving lineage assignments with simple baseline models.
Additionally, a detailed view of lineage classification for the top frequent lineages (>500 sequences) is shown in Fig. 5.As noted, there are lineages that are more commonly misclassified.For instance, several sequences are wrongly classified as B.1.1.119when imputing from the genotyping array regions.Similarly, lineage B.1 is frequently assigned when sequences truly belong to a more specific lineage (lower level in the hierarchical tree) in the imputation from Spike.In the first case, this misclassification is produced by the fact that lineage B. 1 (80%, 75%, 87%, and 57%, respectively).Although the percentages of misclassification are quite high in these cases, affected lineages are less relevant for prospective imputation purposes because they belong to early phases from the virus evolution, with less informative mutations, not being classified as VOI or VOC, and some of them almost or already extinct.Otherwise, VOCs like α and β were more accurately classified imputing from both the genotyping array (78.3% and 99% accuracy, respectively) and Spike region (78.3% and 98.1%).

Imputation of new independent datasets
Previous sections have extensively validated the proposed imputation system under several configurations and strategies.This section shows several use cases and test results produced by independent datasets over the final imputation reference panel (239,301 sequences).
First, 2 recently emerging lineages, B.1.1.7 (α-variant) and B. 1.351 (β-variant), have also been studied in this final testing phase to evaluate the performance of the imputation in new lineages.Sequences recently added to GISAID (not included in our presference panel) under these lineages were selected: 64,398 and 970 sequences, respectively.Their percentage of cosrrectly classified lineages after imputation when missing a 3-kb window (10%) along the entire genome are then calculated (Fig. 6).
As shown in Fig. 6, even when these lineages are underrepresented in the present reference panel (23 and 105 sequences, respectively), the methodology has captured the LD structure at such precision that it can accurately impute the B.1.1.7 and B.1.351lineages from other sequences.Specifically, both lineages obtained 100% accuracy for almost any missing 3-kb region.The imputation accuracy was slightly reduced in the αvariant (B.1.1.7)when the missing regions are located around the center of S protein (99.5% accuracy) or at ORF8 and N proteins (99% accuracy).This behavior is clearly associated with the loss of constitutive mutations for the α-variant such as N501Y, A570D, or P681H, among others [44].In the case of the β-variant (B.1.351),performance slightly decreased at the beginning of protein S (99.5%), as well as around E and M proteins (99.8%).Again, these small decreases are associated with important mutations associated with the lineage such as Q57H or P71L [45].

Imputation for sequencing kits and low-quality sequences
Eight SARS-CoV-2 samples were sequenced using the DeepChek R -8-plex CoV-2 genotyping array (see Table 2).The partial sequences covering ∼20% of the whole viral genome were used to impute the remaining non-covered 80% of the genome with impuSARS.Then, the same samples were subjected to WGS.The imputed whole-genome sequences and lineages were subsequently compared against each other, rendering a highly reliable imputation sequence and 100% successful lineage imputation.FASTQ files as well as consensus whole-genome sequences for both genotyping array and WGS of these 8 samples are available for download at the European Nucleotide Archive (ENA) under the accession ID PRJEB43882.Also, imputation results (both imputed consensus wholegenome sequences and lineages) are provided in the Zenodo repository [46].Coverage distribution from initial genotyping array results is provided in Supplementary Fig. S3.The 3 main quality metrics and imputed lineages are shown in Table 2.A more detailed table including mutation counts and additional metrics is provided (Supplementary Table S2).
To further illustrate the usefulness of the imputation system in a real clinical scenario, a use case of the Hospital Virgen del Rocio is described.In a routine survey a sample was analyzed by RT-PCR using a RUO kit (see Material and Methods section for details), which raised a warning suggesting that it may belong to the emerging β-variant (B.1.351),a VOC.The sample was immediately submitted to confirmatory WGS, which resulted in a poor-quality sequencing, with only 28.91% of SARS-CoV-2 genome covered, having 71 amplicons completely non-covered and 3 covered at low depth (<20×).Lineage assignment with current tools like PANGOLIN is impossible in this low-quality scenario.However, it was urgent to confirm or discard the presence of a VOC for epidemiologic surveillance and medical decision making.Therefore, impuSARS was used on this poor-quality sequence and lineage imputation was carried out with PANGOLIN, producing a B.1.1.7 lineage (α) assignment, also a VOC, but currently more extended in Spain.Detailed analysis of the pattern of available mutations also supported this lineage assignment (see Table 3).

Indels imputation
As shown in previous sections, impuSARS was originally designed and validated for imputation of single-nucleotide polymorphisms (SNPs).In fact, SNPs clearly represent most muta-   tions in SARS-CoV-2 sequences with >33 SNPs per sequence against only 3.12 deletions and almost no insertions (0.4 on average) (see frequencies in Supplementary Fig. S4).However, emerging VOCs are progressively incorporating more indels of interest, mainly short deletions of 1-3 codons (3-12 nucleotides).This is the case, for example, for the 2-codons S:69-70del in αvariant, the 3-codons deletion ORF1a:3675-3677del in β-variant, or, more recently, the deletion S:157-158del in δ-variant.Consequently, impuSARS has been recently updated to accept and also impute short indels by designing a new reference panel (v3.0).
Although it is out of the scope of this articlemputation has been successfully validated with the most representative indels like those previously mentioned.In fact, indel imputation has also produced significant improvements in lineage classification.

Figure 1 :
Figure 1: Imputation performance metrics (precision, recall, F1-score, MCC, and BACC) depending on missing genome percentage.(A) One random continuous block of the genome; (B) random selection of missing variants; (C) random selection of missing amplicons.In the Boxplot the box contains the two quartiles around the median, represented by the horizontal line in the box, and the wiskers represent the maximum and minimum value.Dots outside these limits are outlayers.

Figure 2 :
Figure 2: Imputation performance metrics (precision, recall, and MCC) based on the position of a missing 3-kb window along the SARS-CoV-2 genome.Left y-axis values represent variant frequencies (dashed green line).SARS-CoV-2 protein regions are represented by colored background and names specified at the top.

Figure 3 :
Figure 3: Principal imputation performance metrics (precision, recall, and MCC) calculated depending on imputed variant frequencies.(A) Imputation quality when imputing from the genotyping array positions; (B) imputation quality when imputing from Spike protein positions.Left y-axis (green) represents the number of variants for those frequency thresholds (log scale).

Figure 4 :
Figure 4: Lineage classification accuracy compared against 2 baseline models.(A) Lineage accuracy when imputing from the genotyping array positions; (B) lineage accuracy when imputing from Spike protein region.Levels represent lineage specification.

Figure 5 :
Figure 5: Accuracy obtained for each pair of lineages (real vs imputed) for the top frequent lineages (>500 sequences).Left heat map represents the obtained values for genotyping array imputation whereas right heat map represents accuracies for imputation from Spike protein region.Color represents the percentage of sequences in each real lineage classified by each imputed lineage (the darker, the higher).

Table 1 :
Performance metrics (recall, precision, and MCC) Metrics obtained for 10-fold cross-validation subsets imputing from the genotyping assay and Spike protein regions.Values are calculated for the entire test subset imputation.

Table 2 :
Variant imputation metrics (precision, recall, and MCC) and lineage classification for 8 independent samples internally sequenced with both the genotyping array and whole-genome sequencing. Values